library(tidyverse)
library(kableExtra)
library(DT)
library(plotly)
library(readxl)
library(ggthemes)
library(maps)

library(viridis) # Color gradients
library(lubridate)
library(qtlcharts) # Interactive scatter plots 

library(randomForest) 
library(ranger)       # a faster implementation of randomForest
library(caret)        # performe many machine learning models
library(broom)  # try: `install.packages("backports")` if diretly installing broom gives an error

library(reshape2) # to use melt()

library(deSolve)
library(permimp) # for calculating conditional importance in a speedy way

1 Executive Summary

This report aims to investigate factors that affected the implementation of the global COVID-19 vaccination rollout. Specifically, this report investigates the time lag between the 1st and 2nd vaccine dose inoculation to quantitatively determine the delay in the vaccinated population achieving full immunity. It is found that the majority of country time lags are between 1-2 months, showing that the multiple dose nature of the vaccine rollout causes a considerable delay in achieving full immunity.

In addition, the influence of various demographic, socio-economic and health factors on vaccination speed and coverage is algorithmically assessed. It is found that the log of cardiovascular disease, GHS index and life satisfaction have the greatest influence on the vaccination rollout index, an index that conveys both speed and coverage of the vaccination rollout.

These results can be utilised by epidemiologists or research students to inform their research into how to achieve greater success and efficiency with future vaccine rollouts. The analysis in this report has additionally been represented in an interactive Shiny application.

The following schematic diagram (Figure 1.1) shows the steps undertaken to investigate the two main questions in this study:

Figure 1.1: The flow chart of the project working process

2 Background and Motivation

Since the 8th of December 2020, COVID-19 vaccines have been developed and administered to the public. There has been strong evidence of the success of COVID-19 vaccines in reducing fatalities, severe infections, and hospitalisations (Deb et al., 2021). Given the effectiveness of vaccines, it is appropriate to evaluate potential factors affecting the speed and coverage of the vaccine rollout. If certain demographic, socio-economic or health factors can be shown to influence vaccine rollout, it can help address efforts to improve vaccine uptake outcomes.

The speed at which the vaccinated population was provided full immunity was affected by the multiple dose design of the vaccine rollout. One dose provides partial immunity and, after a length of time, an additional dose is required to achieve full immunity. The delay in achieving full immunity is a key factor in how quickly full immunity can be provided to a population, influencing the effectiveness of a multiple dose rollout.

2.1 Data

covid_data = read_csv("data/covid_data_latest.csv")
covid_data$date = ymd(covid_data$date)

policy = read_csv("data/covid-vaccination-policy.csv")
policy$Day = ymd(policy$Day)


happiness = read_csv("data/happiness-cantril-ladder.csv")
ghs = read_csv("data/2021-GHS-Index-April-2022.csv")
corruption = read_xlsx("data/CPI2020_GlobalTablesTS_210125.xlsx", sheet=1, skip=2)

source("data/ct_model.R")

The COVID-19 time-series dataset provided by the ‘Our World in Data’ online publication (Hannah Ritchie & Roser, 2020) was the primary dataset analysed. This dataset contains data on 207 countries, with measured attributes such as COVID-19 cases, deaths, tests and vaccinations. A COVID vaccination policy dataset from the same source (Hannah Ritchie & Roser, 2020) was also explored in this report, providing time-series data for COVID-19 vaccination availability in 207 countries, represented as a categorical index.

The following datasets were used to provide predictive factors for the vaccine rollout:

  1. The Global health security (GHS) index data, was sourced from the GHS index repository. It measures the capacity for 195 countries to prepare for pandemics from 2021 to 2022. (“GHS Index Report and Data,” 2022)

  2. ‘Our World in Data: Happiness and Life Satisfaction’ dataset contains variables providing measures of happiness and life satisfaction on 160 countries. (Ortiz-Ospina & Roser, 2013)

  3. The Corruption Perceptions Index dataset contains data of perceptions of public sector corruption. The index ranges from 0 to 100, whereby 0 corresponds to maximal corruption. (“2020 Corruption Perceptions Index - Explore the Results,” 2021)

3 Methods

3.1 Time lag between 1st and 2nd dose

To calculate the time lag between a vaccinated population receiving the 1st and 2nd vaccine dose, the distance/difference in days between the people vaccinated (at least 1 vaccine dose) and people fully vaccinated (2 vaccine doses) time-series attributes was calculated. This was achieved by continuously shifting the time-series attribute curves until the curves, which both follow logistic functions, are roughly overlapping. When overlapping occurs, the distance between the two time-series attributes is minimized. The distance that will be used in this analysis is Euclidean distance. The number of times the curves need to be shifted to achieve minimal distance is the difference in days between the two time-series attributes.

The time lag will be calculated for each country, during the period of the 1st of February 2021 to the 1st of August 2021, except for a subset of countries that do not have a population greater than 1,500,000 people and whose data on vaccine uptake is not adequate.

# smoother function, returns smoothed column

Lowess <- function(data, f) {
  lowess_fit <- lowess(data, f = f)
  return(lowess_fit$y)
}
lag_covid = covid_data
lag_covid = lag_covid %>%
  filter(population > 1500000) %>%
  filter(gdp_per_capita > 0)
countrys <- unique(lag_covid$location)
deleted <- c("Afghanistan", "Antigua and Barbuda", "Bangladesh","Benin", "Bhutan", "Bonaire Sint Eustatius and Saba", "Botswana", "Burundi","Burkina Faso", "Cameroon", "Cote d'Ivoire", "Democratic Republic of Congo", "Ethiopia","Eritrea", "Gabon", "Ghana", "Guernsey", "Guinea", "Kenya", "Kuwait", "Liberia", "Laos", "Namibia", "Nepal","Nicaragua", "Niger", "Nigeria", "Palestine", "Philippines", "Pitcairn", "Rwanda", "Saint Helena", "Senegal", "Sierra Leone", "Somalia", "South Sudan", "Sudan", "Tokelau", "Turkmenistan","Tanzania", "Uganda","Yemen", "World", "Zambia")
countrys = countrys[! countrys %in% deleted]
lag_covid = select(lag_covid, "date", "location", "people_vaccinated_per_hundred", "people_fully_vaccinated_per_hundred")
start_date = "2021-02-01"
end_date = "2021-08-01"
lag_covid = lag_covid %>% filter(date >= start_date & date < end_date)
lag_covid$people_vaccinated_per_hundred[is.na(lag_covid$people_vaccinated_per_hundred)] = 0
lag_covid$people_fully_vaccinated_per_hundred[is.na(lag_covid$people_fully_vaccinated_per_hundred)] = 0

lagValue <- function(FirstDose, SecondDose, windowsize, p)
{
  # vector for all measures of distance between matrices
  dist_vector = c()
  i = 1
  while (i <= windowsize){
    # select different subsets of matrices, calculate the distances between the 2 matrices and store the distance. This while loop will contain information for 1st vaccine lag
    FirstDose_subset <- FirstDose[i:nrow(FirstDose),1]
    SecondDose_subset <- SecondDose[1:(1 - i + nrow(SecondDose)),1]
    dist_FirstDose <- proxy::dist(t(FirstDose_subset), t(SecondDose_subset), method = "Minkowski", p = p)
    dist_vector = c(dist_vector, dist_FirstDose)
    i = i + 1
  }
  
  
  j = 1
  while (j <= windowsize){
    # select different subsets of matrices, calculate the distances between the 2 matrices and store the distance. This while loop will contain information for 2nd vaccine lag
    FirstDose_subset1 <- FirstDose[1:(1 - j + nrow(FirstDose)),1]
    SecondDose_subset1 <- SecondDose[j:nrow(SecondDose),1]
    dist_SecondDose <- proxy::dist(t(FirstDose_subset1), t(SecondDose_subset1), method = "Minkowski", p = p)
    dist_vector = c(dist_vector, dist_SecondDose)
    j = j + 1
  }
  
  # select min value index which corresponds to value of the lag
  return(which.min(dist_vector))
}  
lag_vector_1 <- c()
lag_vector_2 <- c()
lag_vector_3 <- c()
# loop through each country and calculate 3 time lags: manhattan distance, euclidean distance and minkowski (p=3) time lags
p = 1
while (p <= 3){
  z = 1
  while (z <= length(countrys)){
    # only select records for certain country and only select 1st and 2nd vaccine columns
    lagCovid_filtered = filter(lag_covid, location == countrys[z])
    combined_matrix <- cbind(lagCovid_filtered[,3], lagCovid_filtered[,4])
    
    # In the dataset, there are missing values. Will replace these missing values (0) with the value from the date before. Do it for both 1st and 2nd vaccine columns
    
    for (i in 1:nrow(combined_matrix)){
      if (i == 1){
      } else{
        if (combined_matrix[i,1] == 0){
          combined_matrix[i,1] = combined_matrix[i-1, 1]
        }
      }
    }
    
    for (j in 1:nrow(combined_matrix)){
      if (j == 1){
      } else{
        if (combined_matrix[j,2] == 0){
          combined_matrix[j,2] = combined_matrix[j-1, 2]
        }
      }
    }
    
    # Apply smoothing function to 1st and 2nd vaccine columns. f = 0.15 is an arbitrary value
    
    combined_matrix_smooth<- as.matrix(apply(combined_matrix, 2, Lowess, f = 0.15))
    
    # Store each column separately as individual matrices
    FirstDose_matrix = as.matrix(combined_matrix_smooth[,1])
    SecondDose_matrix = as.matrix(combined_matrix_smooth[,2])
    
    # Input the individual matrices into the lagValue function to find the lag between the 1st and 2nd dose for a particular country
    lag <- lagValue(FirstDose_matrix, SecondDose_matrix, windowsize=100, p)
    
    #store value of lag
    if (p == 1){
      lag_vector_1 <- c(lag_vector_1, lag)
    } else if (p == 2){
      lag_vector_2 <- c(lag_vector_2, lag)
    } else {
      lag_vector_3 <- c(lag_vector_3, lag)
    }
    z = z + 1
  }
  p = p + 1
}
# label the lag values with the corresponding country
names(lag_vector_1) <- countrys
names(lag_vector_2) <- countrys
names(lag_vector_3) <- countrys
# function to explain the time lag value

lagType <- function(lag, windowsize)
{ # Function to convert indice value given by lagValue to a value for the Time Lag.
  # Any lag values that are greater than windowsize were part of the 2nd half of the 'dist_vector' from the lagValue function, the half of the vector for the 2nd vaccine lag.
  # Therefore need to subtract off all the days from the 1st half of the 'dist_vector' to get number of days for 2nd vaccine lag.
  # No such issue for 1st vaccine lag as all values are within first half.
  if (lag > windowsize){
    return(c(LagType = "Second Dose Lag", Lag = lag - windowsize - 1))
  } else {
    return(c(LagType = "First Dose Lag", Lag = lag - 1))
  }
}
# Apply function to each country's Time lag value 
lag_df_1 = mapply(lagType, lag = lag_vector_1, windowsize = 100)
lag_df_2 = mapply(lagType, lag = lag_vector_2, windowsize = 100)
lag_df_3 = mapply(lagType, lag = lag_vector_3, windowsize = 100)

# Visualise Time lags

total_lag = cbind(t(lag_df_1), t(lag_df_2), t(lag_df_3))
colnames(total_lag) = c("LagType", "Lag: Euclidean distance", "LagType", "Lag: Manhattan distance", "LagType", "Lag: Minkowksi distance (P=3)")

3.2 Vaccine rollout index (VRI) and variable importance

3.2.1 Exploration

The vaccine availability policy data for each country contains an index representing which groups of people can access vaccines, with larger index values representing greater access. For each stage of availability, the speed of vaccination was derived using the slope of the number of people vaccinated over time in each period. As evident in the bar chart (Figure 3.1) below, stages 3 and 4 have the highest speed and stage 5 has a lower speed. Despite stage 5 allowing greater access, this decrease in vaccination speed can be attributed to a significant proportion of the population being already vaccinated in stages 3 and 4. Greater availability doesn’t correlate with higher vaccination speed. Rather, there are other factors that affect vaccination uptake.

# Country list
countries = c("United States", "India", "Brazil", "France", "United Kingdom", "Russia", 
              "Turkey", "Italy", "Germany", "Spain", "Argentina",
                "Iran", "Colombia", "Poland", 'Mexico', "Netherlands", 
              "Indonesia", "Ukraine", "South Africa", "Philippines", "Peru", "Belgium",
                "Czechia", "Japan", "Iran")

# Select certain countreis
policy_data <- covid_data %>% 
  filter(location %in% countries)

policy <- policy %>% filter(Entity %in% countries)

policy_data <- policy_data %>% 
  select(iso_code, continent, location, date,new_people_vaccinated_smoothed_per_hundred,  new_people_vaccinated_smoothed, new_vaccinations, new_vaccinations_smoothed, new_vaccinations_smoothed_per_million, population, people_vaccinated)

# Merge the policy data with the COVID data
policy_data <- merge(policy_data, policy, by.x = c('iso_code', "date", "location"), by.y = c('Code', 'Day', "Entity"))
#The 'get slope' function was created to calculate the slope of each country's vaccination policy from stage 1 to stage 5. The slope is calculated using the lm function. This is then extracted and stored in a slope. The slope is essentially the speed of vaccination uptake with respect to the vaccination policy, which is then compared to other countries in a graph.
get_slope <- function(country_data, country_policy, country) {
  i = 1
  country_slope <- data.frame(policy = as.factor(1:5), speed = rep(0, 5), country = rep(country, 5))
  
  while (i <= length(country_policy$vaccination_policy)) {
    
    # Select current policy
    current_policy = country_policy$vaccination_policy[i]
    
    if (current_policy > 0 & !is.na(country_policy$people_vaccinated[country_policy$vaccination_policy == current_policy])) {
      
      # Extract the number of people vaccinated
      temp_people <- country_data$people_vaccinated[country_data$vaccination_policy == current_policy & !is.na(country_data$people_vaccinated)] / country_data$population[1]

      temp_index <- 1:length(temp_people)
      
      if (length(temp_index) > 1) {
        # Linear model
        mod <- lm(temp_people ~ temp_index)
        
        # Extract and store the slope
        country_slope$speed[country_slope$policy == current_policy] <- summary(mod)$coefficients[2,1]
      } else {
        # Some countries only have a policy for 1 day -> treat the slope as 0
        country_slope$speed[country_slope$policy == current_policy] <- 0
      }
      
    } 
    
    i <- i + 1
    
  }
  
  return(country_slope)
}
country_slope = NULL
#loop through the countries in the dataset and subset the required variables in another dataframe
for (country in countries) {
  country_data <- policy_data[policy_data$location == country, ]
  
  country_policy <- country_data[!is.na(country_data$people_vaccinated), ] %>%
    group_by(vaccination_policy) %>%
    arrange(date) %>%
    slice(1L) %>%
    select(iso_code, location, date, vaccination_policy, new_vaccinations_smoothed, population, people_vaccinated)
  
#Apply the 'get_slope' function to calculate the speed of each country then store in another vector
  temp_slope <- get_slope(country_data, country_policy, country)

  country_slope <- rbind(country_slope, temp_slope)
}

country_slope$policy <- as.factor(country_slope$policy)
#plot the
global <- ggplot(data = country_slope, aes(x = country, y = speed, fill = policy)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_y_sqrt() + 
  ggtitle("All countries") + 
  xlab("Countries") + 
  ylab("Slope") + 
  labs(fill = "Avalability stage")

ggplotly(global) %>% 
  layout(autosize = F, width = '100%', height = '100%')

Figure 3.1: The vaccination speed for different policy periods for selected countries

3.2.1.1 VRI

An analysis of how factors affect vaccine uptake requires a measure of vaccine uptake. Inspired by M.Kazemi’s research (Kazemi, Bragazzi, & Kong, 2022), we used a vaccination roll-out index (VRI) to represent both vaccination uptake rate (speed) and coverage in each country. It is defined as

\[VRI=r\cdot \frac{d}{N} \quad\]

whereby, \(r\) is the vaccination uptake rate, \(d\) is the total number of people vaccinated, and \(N\) is the total population. The \(d/N\) then measures the coverage of vaccination while \(r\) measures the speed. To estimate \(r\), two methods were undertaken.

3.2.1.2 Logistic growth model

It is observed that in most countries, the number of people vaccinated roughly follows a logistic growth model, which is consistent with the observations of the aforementioned report (Kazemi et al., 2022). Hence, we used a logistic regression to fit the people vaccinated attribute:

\[c(t) = \frac{K}{1+a{e}^{-rt}}\]

where \(c(t)\) is the number of people vaccinated as a function of time \(t\), \(K\) is the number of people vaccinated at the end of the date recorded (the asymptote) and \(r\) is the vaccination uptake rate. The \(r\) is estimated from the inverse of the parameter scal of the non-linear least squares model fitted in R.

3.2.1.3 Asympotic regression

Another approach is investigated to estimate the vaccination uptake rate. The log of the people vaccinated attribute is similar to an asymptotic regression, where the rate of change is a function of the log people vaccinated:

\[ \log (c(t)) = K + (c(0) - K) e^{-rt}\\ \frac{d}{dt} (log(c(t)) = r(K - \log (c(t)) \]

Here \(c(t)\) and \(c(0)\) are the same as defined above, \(K\) is the y-asymptote level, and \(r\) is the vaccine uptake rate. The \(r\) is estimated from the exponential of the parameter lrc of the non-linear least squares model fitted in R.

# Example of use, inspect the r_list to see what's included
r_list1 = ct_model(covid_data, log.y = FALSE, model = "logis")

r_list2 = ct_model(covid_data, log.y = TRUE, model = "asymp")

vri_data <- data.frame(bind_rows(r_list1$vri_data))

The residual standard error (RSE) was used to assess the fit of the models to the people vaccinated attribute. The mean RSE over all the locations is 5.0218802^{5} for the logistic model, which is lower than 1.0581885^{6} for the asymptotic regression. Therefore, the logistic model was the most appropriate method for estimating the vaccination uptake parameter.

3.2.1.4 Selecting predictors of VRI and Preprocessing

Multiple socio-economic, demographic and health-related factors have been considered as potential predictors of a country’s VRI. Ataguba, Ojo, & Ichoku (2016) The full list of predictors used is documented in the Appendix (Table A.1). Log transformation is performed to some attributes to mitigate the problem of highly skewed data. The different scales of variables necessitated min-max normalization to restrict all variables to within 0 and 1.

For each predictor, we investigated its relationship with VRI using a scatter plot as well as a Spearman correlation heatmap. All plots are located in the Appendix (Figure A.1).

# Merge the datasets
corruption <- corruption %>% select(c("ISO3", "CPI score 2020")) 

happiness <- happiness %>% group_by(Code) %>%
  arrange(Year) %>%
  filter(!is.na(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`)) %>%
  dplyr::summarise(satisfaction=last(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`)) 


joint_data <- merge(vri_data,corruption,  by.x=c("iso_code"), by.y=c("ISO3"))

joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))

# GHE-INDEX
ghs <- ghs %>% filter(Year == 2021) %>% select(Country,`OVERALL SCORE`)

## adding iso code to ghs dataset
ghs <- ghs %>% mutate(iso_code = iso.alpha(Country, 3), .before = 1)
ghs <- ghs %>% mutate(
  iso_code = case_when(
    !is.na(iso_code) ~ iso_code,
    Country == "Bosnia and Hercegovina" ~ "BIH",
    Country == "Cabo Verde" ~ "CPV",
    Country == "Congo (Brazzaville)" ~ "COG",
    Country == "Congo (Democratic Republic)" ~ "COD",
    Country == "Côte d'Ivoire" ~ "CIV",
    Country == "eSwatini" ~ "BIH",
    Country == "Kyrgyz Republic" ~ "KGZ",
    Country == "São Tomé and Príncipe" ~ "STP",
    Country == "St Kitts & Nevis" ~ "KNA",
    Country == "St Lucia" ~ "LCA",
    Country == "St Vincent & The Grenadines" ~ "VCT",
    Country == "United Kingdom" ~ "GBR",
    Country == "United States of America" ~ "USA"
  )
)


joint_data <- merge(joint_data, ghs[,-2],  by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[16] <- "GHS_score"
# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
  mutate(
    across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
           ~log(.))
  ) %>% 
  rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
              .fn = ~paste0("log_", .))


min_max_norm <- function(x) {
    (x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}

scaled_data <- data.frame(lapply(vri_extra_logged[4:15], min_max_norm), vri = vri_extra_logged$vri)

scaled_data_cleaned <- scaled_data %>% filter( !is.na(vri) & !is.na(log_population_density)& 
                                    !is.na(log_gdp_per_capita) &!is.na(log_aged_65_older)&
                                      !is.na(log_extreme_poverty)&
                                      !is.na(human_development_index)& 
                                      !is.na(log_cardiovasc_death_rate)& 
                                      !is.na(log_hospital_beds_per_thousand)) %>% 
                                 select(c(vri,log_gdp_per_capita,log_aged_65_older,log_population_density,
                                     CPI.score.2020,log_extreme_poverty,
                                     satisfaction,life_expectancy,human_development_index,
                                     log_cardiovasc_death_rate,log_diabetes_prevalence,
                                     log_hospital_beds_per_thousand, GHS_score)) 
# Appendix

spearman <- cor(scaled_data_cleaned, use="pairwise.complete.obs", method="spearman")

heatmap <- qtlcharts::iplotCorr(
  scaled_data_cleaned,
  reorder=TRUE,
  corr = spearman,
  chartOpts = list(cortitle = "Spearman's correlation")
)
# Remove population_density based on scatter plot and correlation heatmap

scaled_data_cleaned <- scaled_data_cleaned %>% select(-log_population_density)

3.2.1.5 Random forest regression and variable importance

Random forest (RF) regression is a non-parametric algorithm that can handle feature selection problems with numerous highly correlated variables. (Dewi, 2019) The nature of bagging and ensemble techniques used in RF also reduces the likelihood of the model being adversely affected by outliers. Further, the calculation of variable importance in RF covers not only the impact of predictor variables individually but also interactively with other predictor variables. (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008) Hence, RF is an appropriate algorithm for modeling VRIs with the predictor attributes. The association between predictor attributes and VRI was measured by calculating their conditional permutation importance (CPI). The original permutation importance was measured by the mean decrease in error of a RF before and after permuting its values, reflecting its contribution to modeling the response variable. CPI enhanced the permutation approach whereby the permutation of a predictor will be based on the variable with which it is significantly related to, according to a user-defined threshold ([0,1]), thereby reducing the bias resulting from correlated covariates when computing variable importance. (D. Debeer & Strobl, 2020)

# hyper parameter grid search 
## The importance results are produced on mac
mtry <-  seq(2, 6, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)


# Manual Search
#control <- trainControl(method="cv", number=3, search="grid")
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
set.seed(3888)
for (ntree in num_trees) {
    fit <- train( vri~., 
                      data= scaled_data_cleaned, 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                        importance=TRUE,
                      trControl=trainControl(method='cv', 
                        number=5) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse <- 1
model_mse <- modellist[[1]]
highest_r2 <- 0
model_r2 <- modellist[[1]]
  
for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri))
  result_avg <- mean(scaled_data_cleaned$vri)
  mse = mean((scaled_data_cleaned$vri - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri - result)^2))/(sum((scaled_data_cleaned$vri - result_avg)^2))
  if (highest_r2 < r2){
     highest_r2 = r2
     model_r2 = modellist[[i]]
  }
  if (lowest_mse > mse) {
    lowest_mse = mse
    model_mse = modellist[[i]]
  }

 
}

# Calculating importance of features to the model: Handling multicollinear features by using conditional permutation

## https://cran.r-project.org/web/packages/permimp/vignettes/permimp-package.html

set.seed(3888)
optimal_rf <- randomForest(vri ~ ., data= scaled_data_cleaned, mtry = 2, keep.forest = TRUE, keep.inbag = FALSE, ntree= 300)
## compute permutation importance
rf_permimp <- permimp(optimal_rf, progressBar = FALSE, conditional = TRUE,scaled = TRUE, thresholdDiagnostics = TRUE,type="response",do_check = FALSE)

vimp <- as.data.frame(rf_permimp$values)
colnames(vimp)[1] <- "importance"
vimp <- round(vimp, 4)
vimp$var <- row.names(vimp)

options(scipen=999)
vimp <- vimp %>%
  arrange(desc(importance)) %>%
  slice(1:5)
vimp <- vimp %>%
  ggplot(aes(reorder(var,importance), importance)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  theme_bw()+
  ggtitle("Top 5 important variables") +
  xlab("Factors in order") +
  ylab("Scaled importance")


## The permimp() function is sensitive to operating systems, so we save the results of importance
# png(filename="importance.png")
# vimp
# dev.off()

A hyper-parameter grid search was run, with 5-fold cross-validation, to find the optimal hyper-parameters for the RF (number of trees and number of features included in the split), based on mean squared error and \(R^2\) values.

4 Results and Evaluation

4.1 Time Lag

The median time lag of 112 countries was 40.5 days. Cambodia had the smallest time lag value: 18 days. The maximum value was >99 days, the value for multiple countries: Algeria, Congo, Libya, Sri Lanka and Malawi. The interquartile range, a measure of the spread of time lag values, was 31.5 days. The time lag values for individual countries can be seen in Table 4.1 below.

# datatable(total_lag, options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))), caption = "List of different time lags measures for each country.")

total_lag %>% 
  kbl(caption = "List of different time lags measures for each country.",
      format = "html", 
      booktabs = TRUE, 
      row.names = FALSE, 
      escape = TRUE,
      align = "c") %>% 
  kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
                position = "center") %>% 
  scroll_box(height = "400px")
Table 4.1: List of different time lags measures for each country.
LagType Lag: Euclidean distance LagType Lag: Manhattan distance LagType Lag: Minkowksi distance (P=3)
Second Dose Lag 21 Second Dose Lag 15 Second Dose Lag 13
Second Dose Lag 99 Second Dose Lag 99 Second Dose Lag 99
Second Dose Lag 52 Second Dose Lag 52 Second Dose Lag 52
Second Dose Lag 85 Second Dose Lag 82 Second Dose Lag 82
Second Dose Lag 47 Second Dose Lag 47 Second Dose Lag 46
Second Dose Lag 64 Second Dose Lag 64 Second Dose Lag 64
Second Dose Lag 44 Second Dose Lag 43 Second Dose Lag 43
Second Dose Lag 34 Second Dose Lag 34 Second Dose Lag 34
Second Dose Lag 25 Second Dose Lag 28 Second Dose Lag 29
Second Dose Lag 30 Second Dose Lag 35 Second Dose Lag 37
Second Dose Lag 43 Second Dose Lag 41 Second Dose Lag 40
Second Dose Lag 54 Second Dose Lag 54 Second Dose Lag 54
Second Dose Lag 43 Second Dose Lag 42 Second Dose Lag 42
Second Dose Lag 69 Second Dose Lag 61 Second Dose Lag 60
Second Dose Lag 40 Second Dose Lag 41 Second Dose Lag 41
Second Dose Lag 18 Second Dose Lag 19 Second Dose Lag 20
Second Dose Lag 65 Second Dose Lag 64 Second Dose Lag 64
Second Dose Lag 93 Second Dose Lag 93 Second Dose Lag 93
Second Dose Lag 27 Second Dose Lag 30 Second Dose Lag 30
Second Dose Lag 30 Second Dose Lag 29 Second Dose Lag 29
Second Dose Lag 64 Second Dose Lag 64 Second Dose Lag 64
Second Dose Lag 30 Second Dose Lag 30 Second Dose Lag 30
Second Dose Lag 99 Second Dose Lag 98 Second Dose Lag 98
Second Dose Lag 36 Second Dose Lag 46 Second Dose Lag 48
Second Dose Lag 43 Second Dose Lag 44 Second Dose Lag 45
Second Dose Lag 43 Second Dose Lag 42 Second Dose Lag 42
Second Dose Lag 35 Second Dose Lag 35 Second Dose Lag 35
Second Dose Lag 37 Second Dose Lag 37 Second Dose Lag 37
Second Dose Lag 51 Second Dose Lag 50 Second Dose Lag 50
Second Dose Lag 59 Second Dose Lag 62 Second Dose Lag 63
Second Dose Lag 34 Second Dose Lag 34 Second Dose Lag 34
Second Dose Lag 84 Second Dose Lag 84 Second Dose Lag 84
Second Dose Lag 44 Second Dose Lag 45 Second Dose Lag 45
Second Dose Lag 87 Second Dose Lag 88 Second Dose Lag 88
Second Dose Lag 37 Second Dose Lag 37 Second Dose Lag 37
Second Dose Lag 45 Second Dose Lag 44 Second Dose Lag 43
Second Dose Lag 29 Second Dose Lag 29 Second Dose Lag 29
Second Dose Lag 68 Second Dose Lag 68 Second Dose Lag 68
Second Dose Lag 82 Second Dose Lag 82 Second Dose Lag 82
Second Dose Lag 23 Second Dose Lag 23 Second Dose Lag 23
Second Dose Lag 64 Second Dose Lag 64 Second Dose Lag 64
Second Dose Lag 26 Second Dose Lag 25 Second Dose Lag 25
Second Dose Lag 36 Second Dose Lag 35 Second Dose Lag 35
Second Dose Lag 98 Second Dose Lag 94 Second Dose Lag 92
Second Dose Lag 48 Second Dose Lag 42 Second Dose Lag 42
Second Dose Lag 53 Second Dose Lag 56 Second Dose Lag 57
Second Dose Lag 61 Second Dose Lag 56 Second Dose Lag 49
Second Dose Lag 40 Second Dose Lag 38 Second Dose Lag 37
Second Dose Lag 25 Second Dose Lag 26 Second Dose Lag 27
Second Dose Lag 42 Second Dose Lag 42 Second Dose Lag 42
Second Dose Lag 84 Second Dose Lag 84 Second Dose Lag 84
Second Dose Lag 23 Second Dose Lag 23 Second Dose Lag 23
Second Dose Lag 36 Second Dose Lag 36 Second Dose Lag 36
Second Dose Lag 28 Second Dose Lag 30 Second Dose Lag 31
Second Dose Lag 46 Second Dose Lag 42 Second Dose Lag 41
Second Dose Lag 30 Second Dose Lag 24 Second Dose Lag 23
Second Dose Lag 27 Second Dose Lag 29 Second Dose Lag 29
Second Dose Lag 37 Second Dose Lag 35 Second Dose Lag 34
Second Dose Lag 77 Second Dose Lag 77 Second Dose Lag 77
Second Dose Lag 99 Second Dose Lag 99 Second Dose Lag 99
Second Dose Lag 35 Second Dose Lag 35 Second Dose Lag 34
Second Dose Lag 87 Second Dose Lag 87 Second Dose Lag 87
Second Dose Lag 99 Second Dose Lag 99 Second Dose Lag 99
Second Dose Lag 26 Second Dose Lag 26 Second Dose Lag 26
Second Dose Lag 60 Second Dose Lag 60 Second Dose Lag 60
Second Dose Lag 64 Second Dose Lag 67 Second Dose Lag 67
Second Dose Lag 34 Second Dose Lag 38 Second Dose Lag 39
Second Dose Lag 39 Second Dose Lag 38 Second Dose Lag 37
Second Dose Lag 30 Second Dose Lag 28 Second Dose Lag 27
Second Dose Lag 25 Second Dose Lag 25 Second Dose Lag 25
Second Dose Lag 61 Second Dose Lag 61 Second Dose Lag 62
Second Dose Lag 39 Second Dose Lag 39 Second Dose Lag 40
Second Dose Lag 41 Second Dose Lag 41 Second Dose Lag 41
Second Dose Lag 27 Second Dose Lag 27 Second Dose Lag 27
Second Dose Lag 26 Second Dose Lag 27 Second Dose Lag 27
Second Dose Lag 47 Second Dose Lag 48 Second Dose Lag 48
Second Dose Lag 69 Second Dose Lag 55 Second Dose Lag 55
Second Dose Lag 58 Second Dose Lag 56 Second Dose Lag 56
Second Dose Lag 41 Second Dose Lag 42 Second Dose Lag 43
Second Dose Lag 78 Second Dose Lag 79 Second Dose Lag 79
Second Dose Lag 53 Second Dose Lag 55 Second Dose Lag 56
Second Dose Lag 25 Second Dose Lag 25 Second Dose Lag 25
Second Dose Lag 38 Second Dose Lag 37 Second Dose Lag 37
Second Dose Lag 41 Second Dose Lag 36 Second Dose Lag 35
First Dose Lag 0 First Dose Lag 0 First Dose Lag 0
Second Dose Lag 23 Second Dose Lag 13 Second Dose Lag 11
Second Dose Lag 25 Second Dose Lag 25 Second Dose Lag 25
Second Dose Lag 27 Second Dose Lag 26 Second Dose Lag 25
Second Dose Lag 31 Second Dose Lag 31 Second Dose Lag 31
Second Dose Lag 29 Second Dose Lag 29 Second Dose Lag 28
Second Dose Lag 28 Second Dose Lag 32 Second Dose Lag 32
Second Dose Lag 39 Second Dose Lag 39 Second Dose Lag 39
Second Dose Lag 32 Second Dose Lag 33 Second Dose Lag 33
Second Dose Lag 36 Second Dose Lag 37 Second Dose Lag 37
Second Dose Lag 48 Second Dose Lag 51 Second Dose Lag 52
Second Dose Lag 35 Second Dose Lag 33 Second Dose Lag 33
Second Dose Lag 99 Second Dose Lag 99 Second Dose Lag 64
Second Dose Lag 49 Second Dose Lag 49 Second Dose Lag 48
Second Dose Lag 29 Second Dose Lag 29 Second Dose Lag 29
Second Dose Lag 83 Second Dose Lag 83 Second Dose Lag 83
Second Dose Lag 37 Second Dose Lag 41 Second Dose Lag 43
Second Dose Lag 89 Second Dose Lag 89 Second Dose Lag 89
Second Dose Lag 39 Second Dose Lag 41 Second Dose Lag 42
Second Dose Lag 38 Second Dose Lag 38 Second Dose Lag 38
Second Dose Lag 63 Second Dose Lag 50 Second Dose Lag 47
Second Dose Lag 58 Second Dose Lag 55 Second Dose Lag 5
Second Dose Lag 76 Second Dose Lag 75 Second Dose Lag 75
Second Dose Lag 28 Second Dose Lag 29 Second Dose Lag 30
Second Dose Lag 35 Second Dose Lag 33 Second Dose Lag 33
Second Dose Lag 41 Second Dose Lag 44 Second Dose Lag 45
Second Dose Lag 33 Second Dose Lag 32 Second Dose Lag 32
Second Dose Lag 87 Second Dose Lag 88 Second Dose Lag 88
Second Dose Lag 37 Second Dose Lag 40 Second Dose Lag 40


In evaluating the stability of the time lag derivation model, it is noted that the model is stable for different distance metrics. Results for time lags with distance metrics like Manhattan distance and Minkowski distance (p parameter equals 3) yielded similar results to the Euclidean distance time lags. (Figure 4.1)

# Box plots of time lags for all 3 types of time lags measurements

total_lag_df <- as.data.frame(total_lag)
colnames(total_lag_df) = c("LagType", "LagEuclideanDistance", "LagType", "LagManhattanDistance","LagType", "LagMinkowskiDistanceP3")

box <- plot_ly(type = "box")
box <- box %>% add_boxplot(x = total_lag_df$LagManhattanDistance, boxpoints = "all", jitter = 0.3, name = "Manhattan Distance")
box <- box %>% add_boxplot(x = total_lag_df$LagEuclideanDistance, boxpoints = "all", jitter = 0.3, name = "Euclidean Distance")
box <- box %>% add_boxplot(x = total_lag_df$LagMinkowskiDistanceP3, boxpoints = "all", jitter = 0.3, name = "Minkowski Distance (P=3)")
box <- box %>% layout(title = "Box Plot of Lag Between 1st and 2nd Dose", autosize = F, width = '100%', height = '100%')
box

Figure 4.1: Boxplot results of time lags for each distance metric.

Despite different ways of calculating distance, the model is still able to interpret the same roughly equivalent distances between the two time-series attributes.

4.2 VRI

4.2.1 Results of VRI

A summary of VRIs are in Table A.2 in the appendix. The 7 right side outliers (A.2) were: Nauru(0.152543), Saint Helena(0.143903), Gibraltar(0.052261), Nicaragua(0.043231), San Marino(0.040336), Mongolia(0.039360) and Seychelles(0.039159). The countries with lowest VRIs are Burundi(0.000039), Yemen(0.000180), and Haiti(0.000411).

4.2.2 Results of the optimal RF and variable importance

The random forest regressor with optimal hyper-parameters involved building 300 trees with 2 variables considered at a split. The MSE and R squared were approximately \(9.38 × 10^{-6}\) and 0.8852 respectively. The top 5 important predictors were:

  1. log of death rate from cardiovascular disease
  2. Global Health and Security Index
  3. life satisfaction
  4. log of diabetes prevalence
  5. log of life expectancy at birth

Figure 4.2: The top 5 important variables

4.2.3 Evaluation

Evaluation of model accuracy involved assessing the mean square error (MSE) and the coefficient of determination \(R^2\). The MSE measures the closeness between prediction and observed value while \(R^2\) represents the proportion of the predictors’ variance explained by our index VRI. The lower the MSE and the higher the \(R^2\), the more accurate the model.

#VRI 
start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
end <- start %m+% months(4)

r_list4 = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_list4$vri_data))
temp_data <- data.frame(vri4 = temp_data$vri, temp_data %>% select(-vri))
vri_eva_data <- temp_data


start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
start <- start %m+% months(4)
end <- start %m+% months(8)

r_list8 = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_list8$vri_data))
temp_data <- data.frame(vri8 = temp_data$vri, location = temp_data$location)
vri_eva_data <- merge(vri_eva_data, temp_data, by = "location")

start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
start <- start %m+% months(8)

r_listf = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_listf$vri_data))
temp_data <- data.frame(vrif = temp_data$vri, location = temp_data$location)
vri_eva_data <- merge(vri_eva_data, temp_data, by = "location")

As VRI is highly related to time, the time period chosen can affect results. This concept is explored as: Input affects Output”. The period of time between commencement of vaccinations and the last day recorded varies from 1 to 1.5 years in the covid dataset. Therefore, the dataset was separated into \(0-4\) months, \(4-8\) months, and \(8-\text{latest}\) starting on the first day a country had vaccination data. For each selected period, the accuracy was compared using MSE and R squared.

# Merge the datasets
joint_data <- merge(vri_eva_data,corruption,  by.x=c("iso_code"), by.y=c("ISO3"))

joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))

ghs <- ghs %>% mutate(
  iso_code = case_when(
    !is.na(iso_code) ~ iso_code,
    Country == "Bosnia and Hercegovina" ~ "BIH",
    Country == "Cabo Verde" ~ "CPV",
    Country == "Congo (Brazzaville)" ~ "COG",
    Country == "Congo (Democratic Republic)" ~ "COD",
    Country == "Côte d'Ivoire" ~ "CIV",
    Country == "eSwatini" ~ "BIH",
    Country == "Kyrgyz Republic" ~ "KGZ",
    Country == "São Tomé and Príncipe" ~ "STP",
    Country == "St Kitts & Nevis" ~ "KNA",
    Country == "St Lucia" ~ "LCA",
    Country == "St Vincent & The Grenadines" ~ "VCT",
    Country == "United Kingdom" ~ "GBR",
    Country == "United States of America" ~ "USA"
  )
)


joint_data <- merge(joint_data, ghs[,-2],  by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[18] <- "GHS_score"

The dataset between 4 to 8 months has the highest R squared and lowest MSE, 0.83 and 0.000013. For input between 0 to 4 months and 8 to the latest day, R squared is similar: approximately 0.78. However, the 8-month to latest data has the highest MSE, 0.000031, while 0 to 4 months data’s MSE is 0.000014. Accordingly, it is concluded that the time period of input data does affect model output which is most accurate on data between 4 to 8 months. (Appendix: Table A.3)

# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
  mutate(
    across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
           ~log(.))
  ) %>% 
  rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
              .fn = ~paste0("log_", .))

min_max_norm <- function(x) {
    (x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}

scaled_data <- data.frame(lapply(vri_extra_logged %>% select(-iso_code, -location, -vri4, -vri8, -vrif), min_max_norm), vri4 = vri_extra_logged$vri4, vri8 = vri_extra_logged$vri8, vrif = vri_extra_logged$vrif)

# Removing NAs:
scaled_data_cleaned <- scaled_data %>% filter(!is.na(vri4) &
                                                !is.na(vri8) &
                                                !is.na(vrif) &
                                                !is.na(log_population_density) &
                                                !is.na(log_gdp_per_capita) &
                                                !is.na(log_aged_65_older)&
                                                !is.na(log_extreme_poverty)&
                                                !is.na(human_development_index)&
                                                !is.na(log_cardiovasc_death_rate)&
                                                !is.na(log_hospital_beds_per_thousand)) %>%
  select(c(vri4, vri8, vrif, log_gdp_per_capita,log_aged_65_older,log_population_density,CPI.score.2020,
           log_extreme_poverty,satisfaction,life_expectancy,human_development_index,
           log_cardiovasc_death_rate,log_diabetes_prevalence,log_hospital_beds_per_thousand, GHS_score)) 


# hyper parameter grid search (definitely need a bit modify)
mtry <-  seq(2, 12, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)

timeperiod <- c()

# VRI between 0 to 4 months
# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vri4~., 
                      data= scaled_data_cleaned %>% select(-vri8, -vrif), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS
lowest_mse_4 <- 1
model_mse_4 <- modellist[[1]]
highest_r2_4 <- 0
model_r2_4 <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vri4)
  mse = mean((scaled_data_cleaned$vri4 - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri4 - result)^2))/(sum((scaled_data_cleaned$vri4 - result_avg)^2))
  if (highest_r2_4 < r2){
     highest_r2_4 = r2
     model_r2_4 = modellist[[i]]
  }
  if (lowest_mse_4 > mse) {
    lowest_mse_4 = mse
    model_mse_4 = modellist[[i]]
  }
 
}


# VRI between 4 to 8 months

# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vri8~., 
                      data= scaled_data_cleaned %>% select(-vri4, -vrif), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse_8 <- 1
model_mse_8 <- modellist[[1]]
highest_r2_8 <- 0
model_r2_8 <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vri8)
  mse = mean((scaled_data_cleaned$vri8 - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri8 - result)^2))/(sum((scaled_data_cleaned$vri8 - result_avg)^2))
  if (highest_r2_8 < r2){
     highest_r2_8 = r2
     model_r2_8 = modellist[[i]]
  }
  if (lowest_mse_8 > mse) {
    lowest_mse_8 = mse
    model_mse_8 = modellist[[i]]
  }
 
}


# VRI between 8 months to latest date

# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vrif~., 
                      data= scaled_data_cleaned %>% select(-vri8, -vri4), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse_f <- 1
model_mse_f <- modellist[[1]]
highest_r2_f <- 0
model_r2_f <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vrif)
  mse = mean((scaled_data_cleaned$vrif - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vrif - result)^2))/(sum((scaled_data_cleaned$vrif - result_avg)^2))
  if (highest_r2_f < r2){
     highest_r2_f = r2
     model_r2_f = modellist[[i]]
  }
  if (lowest_mse_f > mse) {
    lowest_mse_f = mse
    model_mse_f = modellist[[i]]
  }
 
}


vri_evaluation <- data.frame(`Time period` = c("0 to 4 months", "4 to 8 months", "8 months to current"), MSE = c(lowest_mse_4, lowest_mse_8, lowest_mse_f), `R_squared` = c(highest_r2_4, highest_r2_8, highest_r2_f))

vri_eval_table <- vri_evaluation

4.2.4 Shiny App Results

The results of our project and some analysis processes are illustrated on the Shiny app deployed on shinyapps.io. Two layers of heatmap are plotted on an interactive world map to visualise the comparison of results between locations. The “VRI Overlay” shows the VRI values (scaled into 0-1 using logistic regression) and the “Time Lag Overlay” shows the time lags (days) between different locations. Hover-over labels are used to display the specific values. The panel on the right shows the scaled VRI values for all locations and the table can be sorted to show the locations with the highest/lowest scaled VRI. The bottom panel shows location-specific plots and the time lag table for all locations. Users can either click a country on the map or search and select a country from the dropdown menu to see the location-specific information (default Australia). The time lag graph (left) shows the trends of people vaccinated and people fully vaccinated, so that the user can see if the time lag between the two doses changes with time and compare our calculated time lag with the graph. The vaccination trend graph (right) allows the user to see how well our logistic growth model fits the people vaccinated time series so that they can determine how reliable the VRI is.

5 Discussion and Limitations

5.1 Time Lag

The results of the time lags between 1st and 2nd vaccine doses convey that there is significant variation between countries. This can be mainly attributed to what brand of vaccines the majority of a country’s population is inoculated with. Different brands of vaccines had different recommended waiting times between doses e.g. the AstraZeneca vaccine had a longer waiting period than the Pfizer and Moderna vaccines (12 weeks vs. 3 & 4 weeks). Accordingly, the United Kingdom, whose population is mostly inoculated with the AstraZeneca vaccine, has a time lag of 76 days, in contrast with the United States 28 day time lag, whose population is mostly inoculated with the Pfizer and Moderna vaccines.

76 days is a substantial amount of time to be without full immunity, especially in contrast with 28 days. Hence, the judgment can be made that if the United Kingdom, or other countries that had long time lags due to the AstraZeneca vaccine, wanted to achieve full immunity of their vaccinated population as quickly as possible, they should have invested in Pfizer/Moderna vaccines.

Countries within the minimum value and quartile 1 had time lags of 18 days to 30.5 days. This is 3-4 weeks and reflects the expected time lag of countries who inoculated their populations with mostly Pfizer and Moderna vaccines. While these time lags are considerably shorter than other countries, they still reveal that there is a considerable period of time in which vaccinated people are still not fully protected from COVID-19. This identifies a flaw with the multiple dose rollout and encourages epidemiologists and researchers to explore methods of reducing the waiting periods with a multiple dose rollout or producing single dose vaccines that provide full immunity.

5.2 VRI

The results of variable importance show that the death rate of cardiovascular disease has the most influence on the VRI, distinctly more important than other predictors. This reflects that countries that perform comparatively well in managing health issues like cardiovascular disease are likely to perform well in ensuring their population is immunized. Global Health and Security Index and Life satisfaction are the next two most important variables indicating that countries that are prepared for epidemics and where the majority of people are satisfied with their lives will have better vaccine rollout outcomes

5.3 Limitations

  1. Some countries, e.g. The Democratic Republic of Congo, have excessive missing data for some variables, resulting in an inability to calculate time lags or VRIs, and thus the analysis cannot cover all countries. This effect gets worse as external datasets are added, containing fewer countries.

  2. The discovered factors which affect vaccine uptake may not accurately represent the causal effect, as it’s a model-based result and will change accordingly when hyperparameters, user-defined threshold and even operating systems vary. Further, there may be other factors outside of the used datasets which may also affect vaccine uptake. The problem is that there is no source which contains these variables as comprehensive as the used datasets, which further limits the analysis of the research question.

  3. With an aim of calculating importance instead of predicting VRI, all the data was used as the training set, which could lead to overfitting.

5.4 Future Improvements

Collecting additional datasets containing other factors to test their effect on the vaccine uptake, such as type of vaccine, people’s trust in health institutions, public perception and individual lifestyle. Exploring more factors will give us a more holistic analysis on vaccination uptake.

The conditional permutation importance is sensitive to the choice of hyper-parameter and user-defined threshold. Though it’s proved to be hard to set sensible values, one future improvement is to evaluate how changes on these pre-set values affect the final result.

6 Conclusion

The conditional permutation importance is sensitive to the choice of hyper-parameter and user-defined threshold. Though it’s proved to be hard to set sensible values, one future improvement is to evaluate how changes on these pre-set values affect the final result.

The analysis on the influence of demographic, socio-economic and health factors on vaccination speed and coverage found the following factors most influential - the log of cardiovascular disease, GHS index and life satisfaction. This suggests that positive outcomes regarding these factors will correlate with positive vaccination outcomes.

The constructed Shiny application allows users to compare time lags between countries, trends in vaccination speed and availability of vaccines and the accuracy of the measures of vaccine uptake.

Future studies can build on this report by integrating new and more diverse data to explore more relationships between vaccination outcomes and various health, demographic and socio-economic factors.

7 Student Contributions

Everyone researched and came up with ideas during exploration, some of which were recorded on our Github repository . David mainly developed the idea for the sub-questions while Sylvia researched and proposed ideas for VRI.

The model of Time lag was mainly coded and formulated by David, including visualizing and evaluating the results. Aurora and Tina collaborated on the data exploration on VRI based on vaccination availability, while Sylvia built the Random Forest model and calculated the variable importance with data cleaning helped by Vighnesh. Diana further refined the r estimation models, their evaluation, and the VRI calculation. Aurora evaluated the model on the effect of input time selection. Harun created clusters for VRI. Diana wrote all the Shiny app code and Vighnesh and Sylvia wrote the About page.

For the report, Vighnesh wrote the executive summary and collaborated with Tina on the Background and Data section. David contributed majorly to the time lag and result section. For the VRI, Sylvia wrote the methods, results, and discussion while Aurora wrote about exploration and evaluation. Diana wrote the goodness-of-fit evaluation for the asymptotic model and the Shiny app section. Harun wrote the limitations and future improvements along with proofreading the report.

8 References

(n.d.). World Health Organization. Retrieved from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/media-resources/science-in-5/episode-26---vaccine-dosage
2020 corruption perceptions index - explore the results. (2021). Retrieved from https://www.transparency.org/en/cpi/2020
Ataguba, J. E., Ojo, K. O., & Ichoku, H. E. (2016). Explaining socio-economic inequalities in immunization coverage in Nigeria. Health Policy and Planning, 31(9), 1212–1224. https://doi.org/10.1093/heapol/czw053
Braveman, P., & Gottlieb, L. (2014). The social determinants of health: It’s time to consider the causes of the causes. Public Health Reports, 129(1_suppl2), 19–31. https://doi.org/10.1177/00333549141291S206
Broman, K. W. (2015). R/qtlcharts: Interactive graphics for quantitative trait locus mapping. Genetics, 199, 359–361. https://doi.org/10.1534/genetics.114.172742
Deb, P., Furceri, D., Jiménez, D., Kothari, S., Ostry, J., & Tawk, N. (2021). The effects of COVID-19 vaccines on economic activity. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4026476
Debeer, Dries, Hothorn, T., & Strobl, C. (2021). Permimp: Conditional permutation importance. Retrieved from https://CRAN.R-project.org/package=permimp
Debeer, D., & Strobl, C. (2020). Conditional permutation importance revisited [Journal Article]. BMC Bioinformatics, 21(1), 307. https://doi.org/10.1186/s12859-020-03622-2
Dewi, C. (2019). Random forest and support vector machine on features selection for regression analysis. International Journal of Innovative Computing, Information & Control: IJICIC, 15, 2027–2037.
E.Gornick, M. (2002). Committee on guidance for designing a national healthcare disparities report. In S. E. K. (Ed.), 2, MEASURING THE EFFECTS OF SOCIOECONOMIC STATUS ON HEALTH CARE. Washington (DC): National Academies Press (US). Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK221050/
Eilers, R., Krabbe, P. F., & Melker, H. E. de. (2014). Factors affecting the uptake of vaccination by the elderly in western society [Journal Article]. Prev Med, 69, 224–234. https://doi.org/10.1016/j.ypmed.2014.10.017
GHS index report and data. (2022). Retrieved from https://www.ghsindex.org/report-model/
Glassman, A. (n.d.). The COVID-19 vaccine rollout was the fastest in global history, but low-income countries were left behind. Retrieved from https://www.cgdev.org/blog/covid-19-vaccine-rollout-was-fastest-global-history-low-income-countries-were-left-behind
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from https://www.jstatsoft.org/v40/i03/
Hannah Ritchie, L. R.-G., Edouard Mathieu, & Roser, M. (2020). Coronavirus pandemic (COVID-19). Our World in Data. Retrieved from https://ourworldindata.org/coronavirus
Kazemi, M., Bragazzi, N. L., & Kong, J. D. (2022). Assessing inequities in COVID-19 vaccine roll-out strategy programs: A cross-country study using a machine learning approach. Vaccines, 10(2). https://doi.org/10.3390/vaccines10020194
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Retrieved from https://CRAN.R-project.org/doc/Rnews/
Maleva, T. M., Kartseva, M. A., & Korzhuk, S. V. (2021). Socio-demographic determinants of COVID-19 vaccine uptake in russia in the context of mandatory vaccination of employees. Population and Economics, 5(4), 30–49. https://doi.org/10.3897/popecon.5.e77832
Ortiz-Ospina, E., & Roser, M. (2013). Happiness and life satisfaction. Our World in Data.
R Core Team. (2022). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Robinson, D., Hayes, A., & Couch, S. (2022). Broom: Convert statistical objects into tidy tibbles. Retrieved from https://CRAN.R-project.org/package=broom
Schmelzer, M. (2017). HTML widgets alignment in rmarkdwon. Retrieved from https://stackoverflow.com/questions/44458961/html-widgets-alignment-in-rmarkdwon
Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. Retrieved from https://plotly-r.com
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests [Journal Article]. BMC Bioinformatics, 9(1), 307. https://doi.org/10.1186/1471-2105-9-307
Tarantola, D., & Dasgupta, N. (2021). COVID-19 surveillance data: A primer for epidemiology and data science. American Journal of Public Health, 111(4), 614–619. https://doi.org/10.2105/AJPH.2020.306088
Wickham, H. (2007). Reshaping data with the reshape package. Journal of Statistical Software, 21(12), 1–20. Retrieved from http://www.jstatsoft.org/v21/i12/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H., & Bryan, J. (2022). Readxl: Read excel files. Retrieved from https://CRAN.R-project.org/package=readxl
Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01
Xie, Y., Cheng, J., & Tan, X. (2022). DT: A wrapper of the JavaScript library ’DataTables’. Retrieved from https://CRAN.R-project.org/package=DT
Zhu, H. (2022). kableExtra: Construct complex table with ’kable’ and pipe syntax.

Appendix

A VRI

A.1 All predictors used

var_list <- data.frame(colnames(joint_data)[5:16])
colnames(var_list) <- "Selected Variable Name"
var_list %>% 
  kbl(caption = "Selected variables for random forrest.",
      format = "html", 
      booktabs = TRUE, 
      row.names = FALSE, 
      escape = TRUE) %>% 
  kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
                position = "center")
Table A.1: Selected variables for random forrest.
Selected Variable Name
aged_65_older
gdp_per_capita
cardiovasc_death_rate
population_density
hospital_beds_per_thousand
human_development_index
extreme_poverty
diabetes_prevalence
life_expectancy
vri8
vrif
CPI score 2020

A.2 Correlation matrix between predictors and VRI

heatmap

Figure A.1: Interactive plots of Spearman’s correlation heatmap and scatter plots bewteen variables and VRI

A.3 VRI results summary

df_sum <- vri_data %>% select(location, vri)

df_sum %>% 
  kbl(caption = "VRI results for all countries",
      format = "html", 
      booktabs = TRUE, 
      row.names = TRUE, 
      escape = TRUE,
      digits = 4) %>% 
  kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
                position = "center") %>% 
  scroll_box(height = "500px")
Table A.2: VRI results for all countries
location vri
1 Afghanistan 0.0024
2 Albania 0.0061
3 Algeria 0.0036
4 Andorra 0.0202
5 Angola 0.0069
6 Anguilla 0.0180
7 Antigua and Barbuda 0.0044
8 Argentina 0.0175
9 Armenia 0.0101
10 Aruba 0.0134
11 Australia 0.0189
12 Austria 0.0176
13 Azerbaijan 0.0118
14 Bahamas 0.0071
15 Bahrain 0.0182
16 Bangladesh 0.0135
17 Barbados 0.0062
18 Belarus 0.0102
19 Belgium 0.0246
20 Belize 0.0121
21 Benin 0.0089
22 Bermuda 0.0211
23 Bhutan 0.0036
24 Bolivia 0.0086
25 Bonaire Sint Eustatius and Saba NA
26 Bosnia and Herzegovina 0.0064
27 Botswana 0.0104
28 Brazil 0.0186
29 British Virgin Islands 0.0109
30 Brunei 0.0264
31 Bulgaria 0.0045
32 Burkina Faso 0.0033
33 Burundi 0.0000
34 Cambodia 0.0254
35 Cameroon 0.0006
36 Canada 0.0294
37 Cape Verde 0.0197
38 Cayman Islands 0.0188
39 Central African Republic 0.0026
40 Chad NA
41 Chile 0.0169
42 China 0.0180
43 Colombia 0.0149
44 Comoros 0.0068
45 Congo 0.0024
46 Cook Islands 0.0071
47 Costa Rica 0.0190
48 Cote d’Ivoire 0.0038
49 Croatia 0.0096
50 Cuba 0.0272
51 Curacao 0.0223
52 Cyprus 0.0193
53 Czechia 0.0172
54 Democratic Republic of Congo NA
55 Denmark 0.0236
56 Djibouti 0.0024
57 Dominica 0.0040
58 Dominican Republic 0.0170
59 Ecuador 0.0232
60 Egypt 0.0096
61 El Salvador 0.0178
62 Equatorial Guinea 0.0036
63 Estonia 0.0131
64 Eswatini 0.0076
65 Ethiopia 0.0026
66 Faeroe Islands 0.0219
67 Falkland Islands 0.0087
68 Fiji 0.0236
69 Finland 0.0209
70 France 0.0202
71 French Polynesia 0.0118
72 Gabon 0.0020
73 Gambia 0.0027
74 Georgia 0.0082
75 Germany 0.0206
76 Ghana 0.0071
77 Gibraltar 0.0523
78 Greece 0.0150
79 Greenland 0.0267
80 Grenada 0.0050
81 Guatemala 0.0107
82 Guernsey 0.0214
83 Guinea 0.0038
84 Guinea-Bissau 0.0088
85 Guyana 0.0087
86 Haiti 0.0004
87 Honduras 0.0133
88 Hong Kong 0.0131
89 Hungary 0.0256
90 Iceland 0.0308
91 India 0.0119
92 Indonesia 0.0124
93 Iran 0.0277
94 Iraq 0.0050
95 Ireland 0.0216
96 Isle of Man 0.0369
97 Israel 0.0229
98 Italy 0.0207
99 Jamaica 0.0050
100 Japan 0.0255
101 Jersey 0.0132
102 Jordan 0.0093
103 Kazakhstan 0.0111
104 Kenya 0.0024
105 Kiribati 0.0191
106 Kuwait 0.0145
107 Kyrgyzstan 0.0032
108 Laos 0.0135
109 Latvia 0.0117
110 Lebanon 0.0056
111 Lesotho 0.0113
112 Liberia 0.0058
113 Libya 0.0057
114 Liechtenstein 0.0185
115 Lithuania 0.0152
116 Luxembourg 0.0199
117 Macao 0.0140
118 Madagascar 0.0007
119 Malawi 0.0009
120 Malaysia 0.0324
121 Maldives 0.0198
122 Mali 0.0011
123 Malta 0.0236
124 Mauritania 0.0074
125 Mauritius 0.0214
126 Mexico 0.0132
127 Moldova 0.0060
128 Monaco 0.0105
129 Mongolia 0.0394
130 Montenegro 0.0089
131 Montserrat 0.0020
132 Morocco 0.0114
133 Mozambique 0.0108
134 Myanmar 0.0078
135 Namibia 0.0033
136 Nauru 0.1525
137 Nepal 0.0106
138 Netherlands 0.0248
139 New Caledonia 0.0119
140 New Zealand 0.0250
141 Nicaragua 0.0432
142 Niger 0.0012
143 Nigeria 0.0013
144 Niue 0.0085
145 North Macedonia 0.0118
146 Norway 0.0201
147 Oman 0.0214
148 Pakistan 0.0088
149 Palestine 0.0077
150 Panama 0.0218
151 Papua New Guinea 0.0007
152 Paraguay 0.0178
153 Peru 0.0167
154 Philippines 0.0124
155 Pitcairn NA
156 Poland 0.0157
157 Portugal 0.0234
158 Qatar 0.0175
159 Romania 0.0131
160 Russia 0.0080
161 Rwanda 0.0184
162 Saint Helena 0.1439
163 Saint Kitts and Nevis 0.0127
164 Saint Lucia 0.0034
165 Saint Vincent and the Grenadines 0.0026
166 Samoa 0.0129
167 San Marino 0.0403
168 Sao Tome and Principe 0.0091
169 Saudi Arabia 0.0144
170 Senegal 0.0015
171 Serbia 0.0112
172 Seychelles 0.0392
173 Sierra Leone 0.0039
174 Singapore 0.0248
175 Sint Maarten (Dutch part) 0.0141
176 Slovakia 0.0115
177 Slovenia 0.0126
178 Solomon Islands 0.0065
179 Somalia 0.0022
180 South Africa 0.0081
181 South Korea 0.0233
182 South Sudan 0.0009
183 Spain 0.0226
184 Sri Lanka 0.0295
185 Sudan 0.0019
186 Suriname 0.0112
187 Sweden 0.0189
188 Switzerland 0.0169
189 Syria 0.0027
190 Taiwan 0.0234
191 Tajikistan 0.0095
192 Tanzania NA
193 Thailand 0.0198
194 Timor 0.0111
195 Togo 0.0033
196 Tokelau NA
197 Tonga 0.0127
198 Trinidad and Tobago 0.0145
199 Tunisia 0.0146
200 Turkey 0.0156
201 Turkmenistan NA
202 Turks and Caicos Islands 0.0116
203 Tuvalu 0.0153
204 Uganda 0.0085
205 Ukraine 0.0075
206 United Arab Emirates 0.0137
207 United Kingdom 0.0161
208 United States 0.0140
209 Uruguay 0.0258
210 Uzbekistan 0.0118
211 Vanuatu 0.0107
212 Venezuela 0.0173
213 Vietnam 0.0297
214 Wallis and Futuna 0.0046
215 Yemen 0.0002
216 Zambia 0.0026
217 Zimbabwe 0.0057
vri_bplot <- ggplot(vri_data,aes(x="",y=vri)) + geom_boxplot(outlier.colour="blue") + ylab("VRI") + xlab("") + ggtitle("Distribution of VRIs") 
ggplotly(vri_bplot) %>% 
  layout(autosize = F, width = '100%', height = '100%')

Figure A.2: Boxplot of VRI values

A.4 VRI time evaluation results

vri_eval_table %>% 
  kbl(caption = "VRI time evaluation result",
      format = "html", 
      booktabs = TRUE, 
      row.names = FALSE, 
      escape = TRUE) %>% 
  kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
                position = "center")
Table A.3: VRI time evaluation result
Time.period MSE R_squared
0 to 4 months 0.0000147 0.7615992
4 to 8 months 0.0000131 0.8260440
8 months to current 0.0000315 0.7924668

B Code & Session Info

B.1 Session info

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## system code page: 936
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] permimp_1.0-2      deSolve_1.32       reshape2_1.4.4     broom_0.7.12      
##  [5] caret_6.0-90       lattice_0.20-45    ranger_0.13.1      randomForest_4.7-1
##  [9] qtlcharts_0.16     lubridate_1.8.0    viridis_0.6.2      viridisLite_0.4.0 
## [13] maps_3.4.0         ggthemes_4.2.4     readxl_1.3.1       plotly_4.10.0     
## [17] DT_0.21            kableExtra_1.3.4   forcats_0.5.1      stringr_1.4.0     
## [21] dplyr_1.0.8        purrr_0.3.4        readr_2.1.2        tidyr_1.2.0       
## [25] tibble_3.1.6       ggplot2_3.3.5      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##   [1] TH.data_1.1-1        colorspace_2.0-2     modeltools_0.2-23   
##   [4] ellipsis_0.3.2       class_7.3-20         fs_1.5.2            
##   [7] proxy_0.4-26         rstudioapi_0.13      listenv_0.8.0       
##  [10] mvtnorm_1.1-3        prodlim_2019.11.13   fansi_1.0.2         
##  [13] coin_1.4-2           xml2_1.3.3           codetools_0.2-18    
##  [16] splines_4.1.3        libcoin_1.0-9        knitr_1.37          
##  [19] jsonlite_1.7.3       pROC_1.18.0          dbplyr_2.1.1        
##  [22] compiler_4.1.3       httr_1.4.2           backports_1.4.1     
##  [25] assertthat_0.2.1     Matrix_1.4-0         fastmap_1.1.0       
##  [28] lazyeval_0.2.2       strucchange_1.5-2    cli_3.1.1           
##  [31] htmltools_0.5.2      tools_4.1.3          qtl_1.50            
##  [34] gtable_0.3.0         glue_1.6.1           Rcpp_1.0.8          
##  [37] cellranger_1.1.0     jquerylib_0.1.4      vctrs_0.3.8         
##  [40] svglite_2.1.0        nlme_3.1-155         crosstalk_1.2.0     
##  [43] iterators_1.0.14     timeDate_3043.102    xfun_0.29           
##  [46] gower_1.0.0          globals_0.14.0       rvest_1.0.2         
##  [49] lifecycle_1.0.1      future_1.24.0        zoo_1.8-9           
##  [52] MASS_7.3-55          scales_1.1.1         ipred_0.9-12        
##  [55] hms_1.1.1            sandwich_3.0-1       parallel_4.1.3      
##  [58] yaml_2.2.2           gridExtra_2.3        sass_0.4.0          
##  [61] rpart_4.1.16         stringi_1.7.6        highr_0.9           
##  [64] foreach_1.5.2        hardhat_0.2.0        lava_1.6.10         
##  [67] matrixStats_0.62.0   rlang_1.0.1          pkgconfig_2.0.3     
##  [70] systemfonts_1.0.4    evaluate_0.15        labeling_0.4.2      
##  [73] recipes_0.2.0        htmlwidgets_1.5.4    tidyselect_1.1.2    
##  [76] parallelly_1.30.0    plyr_1.8.6           magrittr_2.0.2      
##  [79] bookdown_0.24        R6_2.5.1             generics_0.1.2      
##  [82] multcomp_1.4-19      DBI_1.1.2            pillar_1.7.0        
##  [85] haven_2.4.3          withr_2.4.3          survival_3.2-13     
##  [88] nnet_7.3-17          future.apply_1.8.1   modelr_0.1.8        
##  [91] crayon_1.5.0         utf8_1.2.2           party_1.3-10        
##  [94] tzdb_0.2.0           rmarkdown_2.11       grid_4.1.3          
##  [97] data.table_1.14.2    ModelMetrics_1.2.2.2 reprex_2.0.1        
## [100] digest_0.6.29        webshot_0.5.3        stats4_4.1.3        
## [103] munsell_0.5.0        bslib_0.3.1

B.2 All code

labs = knitr::all_labels()
# exclude chunk "setup" and "get-labels" (this chunk)
labs = setdiff(labs, c("setup", "get-labels"))
# can be used later if want all code in one chunk
library(tidyverse)
library(kableExtra)
library(DT)
library(plotly)
library(readxl)
library(ggthemes)
library(maps)

library(viridis) # Color gradients
library(lubridate)
library(qtlcharts) # Interactive scatter plots 

library(randomForest) 
library(ranger)       # a faster implementation of randomForest
library(caret)        # performe many machine learning models
library(broom)  # try: `install.packages("backports")` if diretly installing broom gives an error

library(reshape2) # to use melt()

library(deSolve)
library(permimp) # for calculating conditional importance in a speedy way
covid_data = read_csv("data/covid_data_latest.csv")
covid_data$date = ymd(covid_data$date)

policy = read_csv("data/covid-vaccination-policy.csv")
policy$Day = ymd(policy$Day)


happiness = read_csv("data/happiness-cantril-ladder.csv")
ghs = read_csv("data/2021-GHS-Index-April-2022.csv")
corruption = read_xlsx("data/CPI2020_GlobalTablesTS_210125.xlsx", sheet=1, skip=2)

source("data/ct_model.R")
# Example of use, inspect the r_list to see what's included
r_list1 = ct_model(covid_data, log.y = FALSE, model = "logis")

r_list2 = ct_model(covid_data, log.y = TRUE, model = "asymp")

vri_data <- data.frame(bind_rows(r_list1$vri_data))
# Merge the datasets
corruption <- corruption %>% select(c("ISO3", "CPI score 2020")) 

happiness <- happiness %>% group_by(Code) %>%
  arrange(Year) %>%
  filter(!is.na(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`)) %>%
  dplyr::summarise(satisfaction=last(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`)) 


joint_data <- merge(vri_data,corruption,  by.x=c("iso_code"), by.y=c("ISO3"))

joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))

# GHE-INDEX
ghs <- ghs %>% filter(Year == 2021) %>% select(Country,`OVERALL SCORE`)

## adding iso code to ghs dataset
ghs <- ghs %>% mutate(iso_code = iso.alpha(Country, 3), .before = 1)
ghs <- ghs %>% mutate(
  iso_code = case_when(
    !is.na(iso_code) ~ iso_code,
    Country == "Bosnia and Hercegovina" ~ "BIH",
    Country == "Cabo Verde" ~ "CPV",
    Country == "Congo (Brazzaville)" ~ "COG",
    Country == "Congo (Democratic Republic)" ~ "COD",
    Country == "Côte d'Ivoire" ~ "CIV",
    Country == "eSwatini" ~ "BIH",
    Country == "Kyrgyz Republic" ~ "KGZ",
    Country == "São Tomé and Príncipe" ~ "STP",
    Country == "St Kitts & Nevis" ~ "KNA",
    Country == "St Lucia" ~ "LCA",
    Country == "St Vincent & The Grenadines" ~ "VCT",
    Country == "United Kingdom" ~ "GBR",
    Country == "United States of America" ~ "USA"
  )
)


joint_data <- merge(joint_data, ghs[,-2],  by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[16] <- "GHS_score"
# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
  mutate(
    across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
           ~log(.))
  ) %>% 
  rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
              .fn = ~paste0("log_", .))


min_max_norm <- function(x) {
    (x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}

scaled_data <- data.frame(lapply(vri_extra_logged[4:15], min_max_norm), vri = vri_extra_logged$vri)

scaled_data_cleaned <- scaled_data %>% filter( !is.na(vri) & !is.na(log_population_density)& 
                                    !is.na(log_gdp_per_capita) &!is.na(log_aged_65_older)&
                                      !is.na(log_extreme_poverty)&
                                      !is.na(human_development_index)& 
                                      !is.na(log_cardiovasc_death_rate)& 
                                      !is.na(log_hospital_beds_per_thousand)) %>% 
                                 select(c(vri,log_gdp_per_capita,log_aged_65_older,log_population_density,
                                     CPI.score.2020,log_extreme_poverty,
                                     satisfaction,life_expectancy,human_development_index,
                                     log_cardiovasc_death_rate,log_diabetes_prevalence,
                                     log_hospital_beds_per_thousand, GHS_score)) 

# Appendix

spearman <- cor(scaled_data_cleaned, use="pairwise.complete.obs", method="spearman")

heatmap <- qtlcharts::iplotCorr(
  scaled_data_cleaned,
  reorder=TRUE,
  corr = spearman,
  chartOpts = list(cortitle = "Spearman's correlation")
)
country_slope = NULL
#loop through the countries in the dataset and subset the required variables in another dataframe
for (country in countries) {
  country_data <- policy_data[policy_data$location == country, ]
  
  country_policy <- country_data[!is.na(country_data$people_vaccinated), ] %>%
    group_by(vaccination_policy) %>%
    arrange(date) %>%
    slice(1L) %>%
    select(iso_code, location, date, vaccination_policy, new_vaccinations_smoothed, population, people_vaccinated)
  
#Apply the 'get_slope' function to calculate the speed of each country then store in another vector
  temp_slope <- get_slope(country_data, country_policy, country)

  country_slope <- rbind(country_slope, temp_slope)
}

country_slope$policy <- as.factor(country_slope$policy)
#plot the
global <- ggplot(data = country_slope, aes(x = country, y = speed, fill = policy)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_y_sqrt() + 
  ggtitle("All countries") + 
  xlab("Countries") + 
  ylab("Slope") + 
  labs(fill = "Avalability stage")

ggplotly(global) %>% 
  layout(autosize = F, width = '100%', height = '100%')
# Remove population_density based on scatter plot and correlation heatmap

scaled_data_cleaned <- scaled_data_cleaned %>% select(-log_population_density)
# hyper parameter grid search 
## The importance results are produced on mac
mtry <-  seq(2, 6, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)


# Manual Search
#control <- trainControl(method="cv", number=3, search="grid")
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
set.seed(3888)
for (ntree in num_trees) {
    fit <- train( vri~., 
                      data= scaled_data_cleaned, 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                        importance=TRUE,
                      trControl=trainControl(method='cv', 
                        number=5) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse <- 1
model_mse <- modellist[[1]]
highest_r2 <- 0
model_r2 <- modellist[[1]]
  
for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri))
  result_avg <- mean(scaled_data_cleaned$vri)
  mse = mean((scaled_data_cleaned$vri - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri - result)^2))/(sum((scaled_data_cleaned$vri - result_avg)^2))
  if (highest_r2 < r2){
     highest_r2 = r2
     model_r2 = modellist[[i]]
  }
  if (lowest_mse > mse) {
    lowest_mse = mse
    model_mse = modellist[[i]]
  }

 
}

# Calculating importance of features to the model: Handling multicollinear features by using conditional permutation

## https://cran.r-project.org/web/packages/permimp/vignettes/permimp-package.html

set.seed(3888)
optimal_rf <- randomForest(vri ~ ., data= scaled_data_cleaned, mtry = 2, keep.forest = TRUE, keep.inbag = FALSE, ntree= 300)
## compute permutation importance
rf_permimp <- permimp(optimal_rf, progressBar = FALSE, conditional = TRUE,scaled = TRUE, thresholdDiagnostics = TRUE,type="response",do_check = FALSE)

vimp <- as.data.frame(rf_permimp$values)
colnames(vimp)[1] <- "importance"
vimp <- round(vimp, 4)
vimp$var <- row.names(vimp)

options(scipen=999)
vimp <- vimp %>%
  arrange(desc(importance)) %>%
  slice(1:5)
vimp <- vimp %>%
  ggplot(aes(reorder(var,importance), importance)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  theme_bw()+
  ggtitle("Top 5 important variables") +
  xlab("Factors in order") +
  ylab("Scaled importance")


## The permimp() function is sensitive to operating systems, so we save the results of importance
# png(filename="importance.png")
# vimp
# dev.off()
# Merge the datasets
joint_data <- merge(vri_eva_data,corruption,  by.x=c("iso_code"), by.y=c("ISO3"))

joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))

ghs <- ghs %>% mutate(
  iso_code = case_when(
    !is.na(iso_code) ~ iso_code,
    Country == "Bosnia and Hercegovina" ~ "BIH",
    Country == "Cabo Verde" ~ "CPV",
    Country == "Congo (Brazzaville)" ~ "COG",
    Country == "Congo (Democratic Republic)" ~ "COD",
    Country == "Côte d'Ivoire" ~ "CIV",
    Country == "eSwatini" ~ "BIH",
    Country == "Kyrgyz Republic" ~ "KGZ",
    Country == "São Tomé and Príncipe" ~ "STP",
    Country == "St Kitts & Nevis" ~ "KNA",
    Country == "St Lucia" ~ "LCA",
    Country == "St Vincent & The Grenadines" ~ "VCT",
    Country == "United Kingdom" ~ "GBR",
    Country == "United States of America" ~ "USA"
  )
)


joint_data <- merge(joint_data, ghs[,-2],  by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[18] <- "GHS_score"
# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
  mutate(
    across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
           ~log(.))
  ) %>% 
  rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
              .fn = ~paste0("log_", .))

min_max_norm <- function(x) {
    (x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}

scaled_data <- data.frame(lapply(vri_extra_logged %>% select(-iso_code, -location, -vri4, -vri8, -vrif), min_max_norm), vri4 = vri_extra_logged$vri4, vri8 = vri_extra_logged$vri8, vrif = vri_extra_logged$vrif)

# Removing NAs:
scaled_data_cleaned <- scaled_data %>% filter(!is.na(vri4) &
                                                !is.na(vri8) &
                                                !is.na(vrif) &
                                                !is.na(log_population_density) &
                                                !is.na(log_gdp_per_capita) &
                                                !is.na(log_aged_65_older)&
                                                !is.na(log_extreme_poverty)&
                                                !is.na(human_development_index)&
                                                !is.na(log_cardiovasc_death_rate)&
                                                !is.na(log_hospital_beds_per_thousand)) %>%
  select(c(vri4, vri8, vrif, log_gdp_per_capita,log_aged_65_older,log_population_density,CPI.score.2020,
           log_extreme_poverty,satisfaction,life_expectancy,human_development_index,
           log_cardiovasc_death_rate,log_diabetes_prevalence,log_hospital_beds_per_thousand, GHS_score)) 


# hyper parameter grid search (definitely need a bit modify)
mtry <-  seq(2, 12, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)

timeperiod <- c()

# VRI between 0 to 4 months
# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vri4~., 
                      data= scaled_data_cleaned %>% select(-vri8, -vrif), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS
lowest_mse_4 <- 1
model_mse_4 <- modellist[[1]]
highest_r2_4 <- 0
model_r2_4 <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vri4)
  mse = mean((scaled_data_cleaned$vri4 - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri4 - result)^2))/(sum((scaled_data_cleaned$vri4 - result_avg)^2))
  if (highest_r2_4 < r2){
     highest_r2_4 = r2
     model_r2_4 = modellist[[i]]
  }
  if (lowest_mse_4 > mse) {
    lowest_mse_4 = mse
    model_mse_4 = modellist[[i]]
  }
 
}


# VRI between 4 to 8 months

# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vri8~., 
                      data= scaled_data_cleaned %>% select(-vri4, -vrif), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse_8 <- 1
model_mse_8 <- modellist[[1]]
highest_r2_8 <- 0
model_r2_8 <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vri8)
  mse = mean((scaled_data_cleaned$vri8 - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vri8 - result)^2))/(sum((scaled_data_cleaned$vri8 - result_avg)^2))
  if (highest_r2_8 < r2){
     highest_r2_8 = r2
     model_r2_8 = modellist[[i]]
  }
  if (lowest_mse_8 > mse) {
    lowest_mse_8 = mse
    model_mse_8 = modellist[[i]]
  }
 
}


# VRI between 8 months to latest date

# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
    fit <- train( vrif~., 
                      data= scaled_data_cleaned %>% select(-vri8, -vri4), 
                      method='rf', 
                      tuneGrid=grid_param, 
                        ntree= ntree,
                      trControl=trainControl(method='cv', 
                        number=3) )
    key <- toString(ntree)
    modellist[[key]] <- fit$finalModel

}


## COMPARE RESULTS

lowest_mse_f <- 1
model_mse_f <- modellist[[1]]
highest_r2_f <- 0
model_r2_f <- modellist[[1]]

for (i in c(1:length(modellist))) {

  result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
  result_avg <- mean(scaled_data_cleaned$vrif)
  mse = mean((scaled_data_cleaned$vrif - result)^2)
  r2 = 1 - (sum((scaled_data_cleaned$vrif - result)^2))/(sum((scaled_data_cleaned$vrif - result_avg)^2))
  if (highest_r2_f < r2){
     highest_r2_f = r2
     model_r2_f = modellist[[i]]
  }
  if (lowest_mse_f > mse) {
    lowest_mse_f = mse
    model_mse_f = modellist[[i]]
  }
 
}


vri_evaluation <- data.frame(`Time period` = c("0 to 4 months", "4 to 8 months", "8 months to current"), MSE = c(lowest_mse_4, lowest_mse_8, lowest_mse_f), `R_squared` = c(highest_r2_4, highest_r2_8, highest_r2_f))

vri_eval_table <- vri_evaluation
sessionInfo()
# only example-plot code chunk